This report explores a tidy dataset that contains almost 1,600 red wines with 11 variables on the chemical properties of the wine.
Some data wrangling has been completed to remove X from dataset and to create quality.factor in the red wine dataset.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Our data set consist of 13 variables, with 1,599 observations.
## int [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Quality was determined by a minimum of 3 wine experts, who rated each wine between 0 (very bad) and 10 (excellent).
Quality has a normal distribution. Note that the values are integers and that minimum and maximum values are 3 and 8 respectively.
A Density Plot is provided on the right to better visualizes the distribution of the data. A Density Plot is a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of help display where values are concentrated over the interval.
## fixed.acidity volatile.acidity citric.acid
## Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 7.90 Median :0.5200 Median :0.260
## Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :15.90 Max. :1.5800 Max. :1.000
Fixed acidity has a normal distribution, while volatile acidity has a bimodal distribution. Citric acid appears to be multimodal with 3 peaks and a tail to the right. All plots have outliers.
Most acids in wine, including fixed or nonvolatile acids are quantified as fixed acidity. The physicochemical test was for tartaric acid. Volatile acidity is the amount of acetic acid in wine, which can lead to an unpleasant, vinegar taste at high of levels. Citric acid, found in small quantities, can add ‘freshness’ and flavor to wines. All measurements are in g / dm^3.
## free.sulfur.dioxide total.sulfur.dioxide sulphates
## Min. : 1.00 Min. : 6.00 Min. :0.3300
## 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.5500
## Median :14.00 Median : 38.00 Median :0.6200
## Mean :15.87 Mean : 46.47 Mean :0.6581
## 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.7300
## Max. :72.00 Max. :289.00 Max. :2.0000
Free sulfur dioxide, total sulfur dioxide, and sulphates all have a normal distribution with a tail to the right and outliers.
Free sulfur dioxide measures the free form of SO2 that exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine. Total sulfur dioxide measures the total amount of free and bound forms of SO2. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. Sulphates is a wine additive which can contribute to S02 levels, this acts as an antimicrobial and antioxidant. All measurements are in g / dm^3.
## residual.sugar chlorides
## Min. : 0.900 Min. :0.01200
## 1st Qu.: 1.900 1st Qu.:0.07000
## Median : 2.200 Median :0.07900
## Mean : 2.539 Mean :0.08747
## 3rd Qu.: 2.600 3rd Qu.:0.09000
## Max. :15.500 Max. :0.61100
Both residual sugar and chlorides have normal distributions with outliers and tails to the right.
Residual sugar is the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram / liter and wines with greater than 45 grams / liter are considered sweet. Chlorides measure the amount of salt (sodium chloride) in the wine. All measurements are in g / dm^3.
## density pH
## Min. :0.9901 Min. :2.740
## 1st Qu.:0.9956 1st Qu.:3.210
## Median :0.9968 Median :3.310
## Mean :0.9967 Mean :3.311
## 3rd Qu.:0.9978 3rd Qu.:3.400
## Max. :1.0037 Max. :4.010
Both density and pH are normal distributions with a few outliers.
Density measure the density of the wine. All observation should be very close to 1 g / cm^3, the density of water. Variations are present due to differences in the percent alcohol and sugar content. pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3 - 4 on the pH scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol has a normal distribution with a tail to the right.
Alcohol measures the percent alcohol content of the wine, in % by volume.
There are 1,599 red wines in the dataset with 11 input variables based on physiochemical tests (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol), and 1 output variable based on sensory data (quality). All variables are numeric expect quality which is an integer.
Other Observations:
The main feature in the dataset is quality. I’d like to determine which features are best for predicting the quality of red wines. I suspect that some combination of the input variables can be used to build a predictive model for quality.
Citric acid, residual sugar, chlorides, total sulfur dioxide, and alcohol will likely contribute to the quality of red wine.
Yes, I created a new variable called quality.factor, which is a ordered factor with 6 levels (”3”, “4”, “5”, “6”, “7”, “8”) original from the quality variable. I decided to keep both quality and quality.factor because it will give me greater flexibility and save time from having to convert from an ordered factor to and integer and back throughout the project.
There was one unusual distribution, citric acid, which was a multimodal distribution with 3 peaks. The dataset was already tidy, so their was little need for any data wrangling. However I did remove the variable X as it was not needed for this exploration and created quality.factor for convience.
Matrix of plots with wine dataset.
Plot of correlation values between quality and input variables.
These are plots that will look consider a variable against quality.
Fixed acidity
##
## Pearson's product-moment correlation
##
## data: quality and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
On the left is a histogram of fixed acidity faceted by quality, while on the right is a boxplot combined with a scatter-jitter plot of quality and fixed acidity. After the plots is the Pearson’s product-moment correlation calculation. You can find the correlation on the very bottom of the calculation output under sample estimates: cor 0.1240516 Note that statistically significant level is set to 0.05 or 5%.
Fixed acidity has a weak positive correlation with quality.
Volatile acidity
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
Volatile acidity has a moderate negative correlation with quality.
Citric acid
##
## Pearson's product-moment correlation
##
## data: quality and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
Citric acid has a weak positive correlation with quality.
Free sulfur dioxide
##
## Pearson's product-moment correlation
##
## data: quality and free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
Free sulfur dioxide has a weak negative correlation with quality.
Total sulfur dioxide
##
## Pearson's product-moment correlation
##
## data: quality and total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
Total sulfur dioxide has a weak negative correlation with quality.
Sulphates
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
Sulphates has a weak positive correlation with quality.
Residual sugar
##
## Pearson's product-moment correlation
##
## data: quality and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
Residual sugar does not have a statistically significant correlation with quality.
Chlorides
##
## Pearson's product-moment correlation
##
## data: quality and chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
Chlorides has a weak negative correlation with quality.
Density
##
## Pearson's product-moment correlation
##
## data: quality and density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
Density has a weak negative correlation with quality.
pH
##
## Pearson's product-moment correlation
##
## data: quality and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
pH has a weak negative correlation with quality.
Alcohol
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Alcohol has a moderate positive correlation with quality.
Volatile acidity and citric acid
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Volatile acidity and citric acid have a moderate negative correlation.
Volatile acidity and sulphates
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and sulphates
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3060917 -0.2147125
## sample estimates:
## cor
## -0.2609867
Volatile acidity and sulphates have a weak negative correlation.
Volatile acidity and alcohol
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2488416 -0.1548020
## sample estimates:
## cor
## -0.202288
Volatile acidity and alcohol have a weak negative correlation.
Citric acid and sulphates
##
## Pearson's product-moment correlation
##
## data: citric.acid and sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2678558 0.3563278
## sample estimates:
## cor
## 0.31277
Citric acid and sulphates have a moderate positive correlation.
Residual sugar was the only input not to have a significant correlation with the main feature (quality). The remaining 10 outputs are listed by correlation strength:
Of all 10, alcohol had the strongest correlation of 0.48, while free sulfur dioxide had the weakest significant correlation of -0.051. Also, of note, is that there was a lack of wines in the range of 0.125 and 0.25 g / dm^3 of citric acid only for quality levels 7 and 8. This may be caused by different categories of high-quality red wine purposefully have higher or lower levels of citric acid. As you may recall, citric acid can add ‘freshness’ and flavor to wine. Another interesting point is that most high-quality red wines (quality of 7 or 8) seems have sulphates levels between 0.65 and 0.82 g / dm^3.
There where 4 interesting relationships, 2 of which had weak negative correlations, volatile acidity and sulphates, volatile acidity and alcohol. Citric acid and sulphates had a weak positive correlation, while volatile acidity and citric acid had a moderate negative correlation.
The strongest relationship was a negative linear relationship between volatile acidity and citric acid, with a correlation of -0.55.
The plots are of volatile acidity, citric acid, and quality.
You can clearly see that higher quality wines tend to have low levels of volatile acidity, and slightly higher levels of citric acid, than lesser wines.
The plots are of volatile acidity, sulphates, and quality.
The trend of higher quality wines have less volatile acidity still holds. Furthermore, you can also see that sulphates tend to have higher levels in the better-quality wines.
The plots are of alcohol, volatile acidity, and quality.
The plot continues to show the trend of higher quality wines have less volatile acidity, but is also shows that those same wines tend to have higher alcoholic levels.
These plots are of citric acid, sulphates, and quality.
The trend presented in this plot is that higher quality wines tend to have more sulphates present than lower quality wines.
##
## Calls:
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid, data = wine)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density, data = wine)
## m8: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density + chlorides +
## fixed.acidity, data = wine)
## m10: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + total.sulfur.dioxide + density + chlorides +
## fixed.acidity + pH + free.sulfur.dioxide, data = wine)
##
## ==============================================================================================
## m2 m4 m6 m8 m10
## ----------------------------------------------------------------------------------------------
## (Intercept) 3.095*** 2.646*** -7.009 28.165 8.223
## (0.184) (0.201) (11.972) (15.083) (17.026)
## alcohol 0.314*** 0.309*** 0.305*** 0.268*** 0.291***
## (0.016) (0.016) (0.020) (0.021) (0.023)
## volatile.acidity -1.384*** -1.265*** -1.247*** -1.137*** -1.087***
## (0.095) (0.113) (0.116) (0.120) (0.121)
## sulphates 0.696*** 0.710*** 0.916*** 0.891***
## (0.103) (0.104) (0.112) (0.112)
## citric.acid -0.079 -0.093 -0.198 -0.174
## (0.104) (0.120) (0.145) (0.147)
## total.sulfur.dioxide -0.002*** -0.002*** -0.003***
## (0.001) (0.001) (0.001)
## density 9.820 -25.583 -3.864
## (11.931) (15.122) (17.385)
## chlorides -1.584*** -1.879***
## (0.408) (0.419)
## fixed.acidity 0.055** 0.013
## (0.017) (0.023)
## pH -0.482**
## (0.181)
## free.sulfur.dioxide 0.005*
## (0.002)
## ----------------------------------------------------------------------------------------------
## R-squared 0.317 0.336 0.344 0.356 0.360
## adj. R-squared 0.316 0.334 0.342 0.353 0.356
## sigma 0.668 0.659 0.655 0.650 0.648
## F 370.379 201.777 139.219 109.751 89.354
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1621.814 -1599.093 -1589.409 -1575.112 -1569.735
## Deviance 711.796 691.852 683.523 671.409 666.908
## AIC 3251.628 3210.186 3194.818 3170.224 3163.470
## BIC 3273.136 3242.448 3237.835 3223.995 3227.996
## N 1599 1599 1599 1599 1599
## ==============================================================================================
Linear model utilizing 10 out of the 11 input variables.
Model 10 (m10) is the final model with the highest R-squared and AIC value.
R-Squared tells us the proportion of variation in the dependent variable that has been explained by this model. Typically, we want to see a R-squared of 0.7 or greater, but we don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model. The Akaike’s information criterion - AIC (Akaike, 1974) measures the goodness of fit of an estimated statistical model and can also be used for model selection. The lower the AIC value is the better.
A few relationships observed were:
When comparing volatile acidity, alcohol, and quality, the correlations between higher quality wines have higher alcoholic content and less volatile acidity became evident.
One surprising feature was that higher quality tended to have slightly higher levels of sulphates, which is salt. This is surprising because it is breaking a preconception that salter wines may be of higher quality. That being said, by more salt, we are talking about a tenth (0.1) of a g / dm^3, which is a very small amount.
We did create a linear model. The best fitted model created had a R-squared value of 0.360 and an AIC value of 3163. The largest limitation this model faces is that it quality was determined by taste testers. While expert taste testers know a lot more about wine quality than I do, it still presents an element of human. Furthermore, quality in of its self is may not directly relate to an enjoyable wine for your average consumer.
This model is statistically significant with a P value less than 0.05. While we would prefer a higher R-squared value, model 10 (m10) has the highest value available.
This plot is of the correlation values between quality and all of the input variables.
It was chosen as one of the 3 final plots because it enables the reader to grasp the full range of the different correlation values between quality and the input variables. It is important for the reader to understand the different correlation values. A correlation is a statistical measurement that suggest the level of linear dependence between 2 variables. In other words, we can use this as a ruff measurement of which variables we should pay the most attention to (such as alcohol) or which variables will probably be of little importance (such as residual sugar). These values can range from -1 to +1, and the closer the value is to 0 the weaker the correlation is. Furthermore, we are considering any value equal to and less than 0.05 to be statistically insignificant.
This is a jitter plot of alcohol content by quality level with a boxplot and linear regression line overlay.
It was chosen because it clearly presents the relationship between alcohol and quality, the strongest correlation of all the input variables. The jitter plot is a scatter plot that adds a small amount of randomness to the discrete position (quality), which helps to avoid overpotting, and more clearly show concentrations. The box plot is a standardized way of displaying the distribution of the data, and the linear regression line allows us to see the trend that the correlation value was calculating the level of linear dependence of.
This is a jitter and smooth line plot of volatile acidity, alcohol, and quality.
It was chosen because it allows us to see the multivariable trends between volatile acidity, alcohol, and quality. The jitter plot allows us to see the concentrations and quality level, thus enabling use to visually determine if any trends are present. The smooth line plot is another tool that is very effective in allowing us to see what the trend actually looks like. However, due to its nature, it can over generalize, although that can be mitigated through the use of the confidence bands.
The Red Wine Quality dataset was a tidy dataset with almost 1,600 wines, 11 physicochemical (input) variables, and 1 sensory (output) variable. The majority of the wines where of a normal quality with very few being considered of high or low quality. This is the greatest limitation in the dataset because it made it much more difficult to identify chemical trends that could be used to predict high- or low-quality wines. The majority of the wines where not considered sweet, where considered acidic (pH score between 3.2 and 3.4) and had an alcohol content between 9.5 and 11.1%. The majority of the input variables had weak correlation quality, with the strongest correlation being alcohol, followed by volatile acidity. Residual sugar was the only variable to have a statically insignificant correlation to quality. It was found that higher quality wines where more likely to have higher alcohol content, and lower volatile acidity levels. Furthermore, higher quality wines also tended to have slightly higher citric acid and sulphates levels. Lastly, when comparing alcohol content, volatile acidity, and quality, you can clear see the concentration of higher quality wines. All of these findings indicate that alcohol and volatile acidity would be the best 2 indicators of wine quality, a finding supported by the linear regression model which gave these 2 variables high significant ratings.
When starting this analysis, my first thought would be that this would be easy. Primarily, I would only need to consider correlations to quality, make some nice plots, and submit. What I was not expecting was the vast number of low correlations to quality, and intercorrelation between input variables, such as free and total sulfur dioxide. This called for a more in-depth exploration of the data. The box plots and linear regression line overlays worked well on viewing this behavior. Furthermore, the smooth lines in the multivariable plots help to spot tends. However, that brings me the last issue experienced, that is the fact that discrete variables can be rather difficult to analyze. If you put a discrete variable (like quality) into an ordered pair with 2 levels, then can begin to overgeneralize. However, if you do not, then it may be rather had to identify any trends.
As already mentioned the greatest limitation was the lack of high- and low-quality wines in the dataset. However, another limitation is that the quality was determined by human sensory data. While measures were taken to limit the human error, averaging the quality rating from 3 separate testers, it will always remain present in the dataset. The best way to improve this analysis is to obtain more data entrees of high- and low-quality wines. This would balance out the quality variables, allowing for more accurate trends to be identified, regarding high- and low-quality wines.